1. Data Sources & Collection

CMS Nursing Home Provider Information - https://data.cms.gov/provider-data/dataset/4pq5-n9py (2017 - 2021)

CMS Medicare Claims Quality Measures - https://data.cms.gov/provider-data/dataset/ijh5-nb2v (April 2020 - March 2021)

CMS Survey Summary - https://data.cms.gov/provider-data/dataset/tbry-pc2d (2017 - 2020)

CMS MDS Quality Measures - https://data.cms.gov/provider-data/dataset/djen-97ju (April 2020 - March 2021)

CMS COVID-19 Nursing Home Data - https://data.cms.gov/covid-19/covid-19-nursing-home-data (April 2020 - December 2020, January 2021 - March 2021)

2. Construct dataset

3. Data Cleaning

4.Data Exploration & Pre-processing

Both Total nursing staff turnover and Registered Nurse turnover attributes have a large, but acceptable number of nulls from 1000-2000 rows.

Distributions

On visual inspection, it can be noted that ~15 plots are approximately normally distributed, however a number of attributes are not normally distributed or have notably long right- or left-tails. This will be further addressed in assumption testing.

Outlier Detection

Since the data was manually reported by nursing homes regarding their performance, a cautious approach will be taken for outlier detection and removal.

On visual analysis, there are a varying amount of outliers in each attribute, however the majority of attributes have outliers on one-tail, resulting in skew. To quantify this, the skewness will be calculated below:

There is a significant amount of skew in 9 numeric attributes. To address this, right-skewed attributes values with a skew > 3 greater will have values greater than the 90th percentile replaced by the median. Left-skewed values with a skew < -3 will have values less than the 10th percentile replaced by the median. Categorical dummy variables will not be included.

Dataset Normalization

Upon visual analysis of the data description, it is apparently that the scale of the various attributes varies significantly. It is imperative that the data be normalized.

Dataset is normalized to [0,1]

5. Assumption Testing - Linear Regression

Normality of Predictor Distributions

As stated above, ~15 plots are approximately normally distributed, however a number of attributes are not normally distributed or have notably long right- or left-tails. The variables that are not normally distributed will be log transformed to try to address skew.

Log Transformation of independent variables

After transformation, distributions remain non-normal for multiple attributes.

Linearity

On visual inspection, linearity between the dependent variable and the independent variables does not exist for any attribute.

Examination of Pearson's coefficient against the dependent variable confirms a general lack of linearity in the model.
Based on this, it could be stated that the Assumption of Linearity is not fulfilled and that the regression model may not be able to efficiently explain the data.

Multicollinearity

Most attributes are not correlated or weakly correlated.
Number of Certified Beds and Average Number of Residents per Day are the only attributes that were strongly correlated > 0.9.
Additionally, it is suspected that the dummy variables of Ownership Type and Long-Stay QM are highly correlated within themselves.
Registered Nurse turnover and Total nursing staff turnover, COVID-19 deaths per occupied beds and confirmed COVID-19 cases per occupied beds, and Percentage of long-stay residents whose need for help with daily activities has increased and Percentage of long-stay residents whose ability to move independently worsened were moderately correlated at 0.67, 0.64 and 0.57 respectively. This will be checked through VIF.

Based on the VIF estimation of multicollinearity < 5, most variables are not correlated or mildly correlated. As previously established, Number of Certified Beds and Average Number of Residents per Stay are highly correlated. Therefore, Number of Certified Beds will be dropped from the dataset. Additionally, one of the dummy variable attributes will be dropped for Ownership Type and Long-Stay QM Rating.

Utilizing a VIF indicator of multicollinearity <5, it can be stated that the assumption of reducing multicollinearity is satisfied with the following attributes dropped: 'Number of Certified Beds', 'Ownership Type_2', 'Long-Stay QM Rating_5.0'.

Normality of Error terms

Based on preliminary analysis of the regression model, some general observations can be made:

On visual analysis, model residuals seem approximately normal in distribution, with some right skew.

The qq-plot additionally shows the right-skew.

The Jarque-Bera test is significant for a test statistic of 13859 which confirms that the model residuals have enough skewness and kurtosis to be signficiantly different from a normal distribution. Thus, the model fails this assumption.

Autocorrelation of Error Terms

Durbin-Watson test to test autocorrelation of error terms/residuals = 1.9, showing little to no autocorrelation in the model. The errors are independent within the model.

Homoscedasticity

Based on the Breusch-Pagan test (where value 1 is the Lagrange multiplier statistic, and value 2 is the p-value), heteroscedasticity is present within the model. Thus, the residuals are not distributed with equal variance, which means that the results of the regression analysis may not be reliable. To attempt to address this, the dependent variable will be log transformed.

Based on the redone Breusch-Pagan test (where value 1 is the Lagrange multiplier statistic, and value 2 is the p-value), heteroscedasticity continues to be present within the model. Thus, the residuals remain not distributed with equal variance, which means that the results of the regression analysis may not be reliable.

Assumption testing summary:

- Normality of predictor distributions: failed after log transformation of predicator variables <br>
- Linearity of independent and dependant variables: failed <br>
- Mullicolinearity: Passed after removing attributes with high correlation using VIF <br>
- Normality of error terms: Failed due to right shew of error terms<br>
- Autocorrelation of error terms: Passed based on the Durbin-Watson test<br>
- Homoscedasticity: Failed based on the Breusch-Pagan test<br>

The results of assumption testing show that linear regression may not be the ideal test to use for this dataset. Regardless, linear regression will be conducted on the data and the effect of the failed assumptions will be considered in context of the performance of the model.

Model Building

Note the following general observations:

Stepwise regression will be performed to increase stability of the model. Alpha = 0.05 so coefficients with p-value > 0.05 will be removed.

Model 3: Removing 'Percentage of long-stay residents who lose too much weight' attribute results:

Model 4: Removing 'Number of Citations from Infection Control Inspections' attribute results:

Model 5: Removing 'Total Number of Health Deficiencies' attribute results:

Model 6: Removing 'COVID-19 Deaths Per Occupied Beds' attribute results:

Model 7: Removing 'Percentage of long-stay residents experiencing one or more falls with major injury' attribute results:

Model 8: Removing 'Percentage of long-stay residents who were physically restrained' attribute results:

Model 9: Removing 'Adjusted Nurse Aide Staffing Hours per Resident per Day' attribute results:

Model 10: Removing 'Confirmed COVID-19 Cases Per Occupied Bed' attribute results:

Model 11: Removing 'Registered Nurse turnover' attribute results:

Model 12: Removing 'Ownership Type_1' attribute results:

Model 13: Removing 'Percentage of long-stay residents assessed and appropriately given the seasonal influenza vaccine' attribute results:

Model 14: Removing 'Percentage of long-stay residents assessed and appropriately given the seasonal influenza vaccine' attribute results:

Model 15: Removing 'Number of Facility Reported Incidents' attribute results:

Model 16: Removing 'Total Number of Fire Safety Deficiencies' attribute results:

P-values are now all significant. Model attributes that are important to accuracy of prediction retained.

Based on the final model, the attributes with the most effect include:
Average Number of Residents per Day: With a unit increase in Average Residents per Day, there is a -0.144 decrease in the Emergency Department Visit Rate.
Long-Stay QM Rating (1.0): With a unit increase in the 1/5 (or lowest) Long-Stay QM Rating, there is an increase in the Emergency Department Visit Rate.
Percentage of long-stay residents whose need for help with daily activities has increase: With a unit increase in the percentage of long-stay residents who need additional help, there is a -0.1155 decrease in Emergency Dpeartment Visit Rate.
Percentage of long-stay resident who received an antipsychotic medication: With a unit increase in the percentage of patients receiving a anti-psychotic medication, there is a -0.1043 decrease in the Emergency Department Visit rate.

Model Testing

Because the dependent variable was a continuous variable, regression metrics will be used to evaluate the model.

Model Evaluation & Validation

Mean Absolute Error measures the accuracy of the model. A MAE of 0.045 signifies that the model is generally accurate, and closely able to predict the actual values.
R^2 measures the amount of variation that can be explained by the model and currently is at 27%, which means that only 27% of model predictions are correct.
The Root Mean Squared Error shows the spread of the residual errors. A value of 0.003 shows that the model has good performance.

Variance of the errors was generally centered around 0.